feature: unify new_tokens format sample state to trtllm samper tokens format #5513

netanel-haber · 2025-06-26T08:05:51Z

58a8a8f - these changes were previously merged to main here.
6aef149 - the changes were temporarily reverted in main, due to a significant perf regression in models using the TorchSampler (observed by @byshiue).
This PR is meant to re-merge these changes along with a fix to prevent the regression.

The first commit of this PR is actually just the reverted revert - filter it out of the changes to see previously unmerged changes.

…kens format Signed-off-by: Netanel Haber <[email protected]>

tensorrt-cicd · 2025-06-27T10:34:52Z

PR_Github #10145 [ skip ] triggered by Bot

tensorrt-cicd · 2025-06-27T10:40:10Z

PR_Github #10145 [ skip ] completed with state SUCCESS
Skipping testing for commit 700c079

dcampora · 2025-06-27T11:16:57Z

Please ignore the skip, I triggered it by mistake on this PR.

Signed-off-by: Netanel Haber <[email protected]> minimize diff Signed-off-by: Netanel Haber <[email protected]> minimize diff Signed-off-by: Netanel Haber <[email protected]>

…_with_trtllm_sampler_sample_state Signed-off-by: Netanel Haber <[email protected]>

netanel-haber · 2025-06-29T16:26:36Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-06-29T16:31:42Z

PR_Github #10242 [ run ] triggered by Bot

tensorrt-cicd · 2025-06-29T19:44:30Z

PR_Github #10242 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #7569 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

wili-65535

Great work for simplifying of the samplers! LGTM on my side.

tensorrt_llm/_torch/speculative/mtp.py

tensorrt_llm/_torch/speculative/utils.py

tensorrt_llm/_torch/pyexecutor/llm_request.py

tensorrt_llm/_torch/pyexecutor/sampler.py

…_with_trtllm_sampler_sample_state Signed-off-by: Netanel Haber <[email protected]>

…sampling Signed-off-by: Netanel Haber <[email protected]>

netanel-haber · 2025-06-30T13:37:24Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-06-30T13:42:43Z

PR_Github #10369 [ run ] triggered by Bot

tensorrt-cicd · 2025-06-30T16:50:53Z

PR_Github #10369 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #7666 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

dcampora

Approving, as perf issue is now fixed.

suyoggupta

LGTM for AD changes

Copilot

Pull Request Overview

This PR re-applies previously reverted speculative decoding changes and fixes a performance regression in TorchSampler by unifying the new_tokens state format and refactoring sampler interfaces across the codebase.

Refactored get_spec_decoder to accept TorchSampler.Args and integrated TorchSampler in speculative modes.
Overhauled TorchSampler API: introduced Args/Store dataclasses, generic sampling helpers, and unified sample_async/update_requests.
Removed legacy sampler classes (Eagle3Sampler, Eagle3Decoder), updated resource managers and scheduler to use all_requests.

Reviewed Changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
tensorrt_llm/_torch/speculative/utils.py	Updated `get_spec_decoder` signature and error handling for unsupported modes.
tensorrt_llm/_torch/speculative/mtp.py	Adapted `MTPSampler` to new `TorchSampler.Args` and simplified stop‐criteria calls.
tensorrt_llm/_torch/speculative/eagle3.py	Removed legacy Eagle3 sampler classes, added `Eagle3OneModelSampler`.
tensorrt_llm/_torch/pyexecutor/seq_slot_manager.py	Switched loops to use `scheduled_batch.all_requests()`.
tensorrt_llm/_torch/pyexecutor/scheduler.py	Simplified `all_requests` to return a list instead of `chain`.
tensorrt_llm/_torch/pyexecutor/sampler.py	Major refactor of `TorchSampler`: new dataclasses, unified sampling functions, updated state.
tensorrt_llm/_torch/pyexecutor/py_executor.py	Propagated `max_num_sequences`, integrated `SeqSlotManager`, adjusted logit fields.
tensorrt_llm/_torch/pyexecutor/model_engine.py	Updated batch‐index logic (`py_batch_idx`) and input preparation to new sampler format.
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py	Updated `TorchSampler` instantiation to use `Args` and added `SeqSlotManager`.

Comments suppressed due to low confidence (3)

tensorrt_llm/_torch/pyexecutor/sampler.py:98

Add unit tests for top_k_sampling_batch, top_p_sampling_batch, and the generic sample pipeline to validate sampling distributions, edge cases (e.g., top_k=1, top_p=0.0), and correct handling of tensor dimensions.

def top_k_sampling_batch(logits, top_k=50):

tensorrt_llm/_torch/pyexecutor/sampler.py:180

[nitpick] Add a docstring explaining this helper's purpose, the expected format of strategy, logits, and what is returned (next_tokens and softmax probabilities).

def sample(strategy: Strategy, logits: torch.Tensor):

tensorrt_llm/_torch/speculative/utils.py:113

[nitpick] Document this exception in the get_spec_decoder docstring so callers know it will raise for unknown modes, or consider returning None to match previous behavior if that was expected.

        f"Unsupported speculative decoding mode: {spec_config.spec_dec_mode}")

Copilot · 2025-06-30T19:01:22Z

tensorrt_llm/_torch/pyexecutor/scheduler.py

-    @property
-    def all_requests(self) -> chain[LlmRequest]:
-        return chain(self.context_requests, self.generation_requests)
+    def all_requests(self) -> list[LlmRequest]:


[nitpick] Consider returning an Iterable[LlmRequest] or Sequence[LlmRequest] instead of forcing a new list allocation on each call, or change the return annotation to list explicitly to reflect that behavior.

Suggested change

def all_requests(self) -> list[LlmRequest]:

def all_requests(self) -> Sequence[LlmRequest]:

Copilot · 2025-06-30T19:01:22Z

tensorrt_llm/_torch/pyexecutor/sampler.py

+                               new_tokens,
+                               gen_logits_host=gen_logits_host,
+                               log_probs_host=log_probs_host)
+        new_tokens_host = new_tokens.to(device="cpu", non_blocking=True)


Transferring the entire new_tokens tensor to CPU each iteration can be costly. If only a subset of slots is active, consider slicing new_tokens to only copy relevant indices and reduce data movement overhead.

@byshiue

…state to trtllm samper tokens format (NVIDIA#5513) 58a8a8f - these changes were previously merged to main here. 6aef149 - the changes were temporarily reverted in main, due to a significant perf regression in models using the TorchSampler (observed by @byshiue). This PR is meant to re-merge these changes along with a fix to prevent the regression. The first commit of this PR is actually just the reverted revert - filter it out of the changes to see previously unmerged changes. Signed-off-by: Netanel Haber <[email protected]>

@byshiue

…state to trtllm samper tokens format (NVIDIA#5513) 58a8a8f - these changes were previously merged to main here. 6aef149 - the changes were temporarily reverted in main, due to a significant perf regression in models using the TorchSampler (observed by @byshiue). This PR is meant to re-merge these changes along with a fix to prevent the regression. The first commit of this PR is actually just the reverted revert - filter it out of the changes to see previously unmerged changes. Signed-off-by: Netanel Haber <[email protected]>

@byshiue

…state to trtllm samper tokens format (NVIDIA#5513) 58a8a8f - these changes were previously merged to main here. 6aef149 - the changes were temporarily reverted in main, due to a significant perf regression in models using the TorchSampler (observed by @byshiue). This PR is meant to re-merge these changes along with a fix to prevent the regression. The first commit of this PR is actually just the reverted revert - filter it out of the changes to see previously unmerged changes. Signed-off-by: Netanel Haber <[email protected]>

@byshiue

…state to trtllm samper tokens format (NVIDIA#5513) 58a8a8f - these changes were previously merged to main here. 6aef149 - the changes were temporarily reverted in main, due to a significant perf regression in models using the TorchSampler (observed by @byshiue). This PR is meant to re-merge these changes along with a fix to prevent the regression. The first commit of this PR is actually just the reverted revert - filter it out of the changes to see previously unmerged changes. Signed-off-by: Netanel Haber <[email protected]>

@byshiue

…state to trtllm samper tokens format (NVIDIA#5513) 58a8a8f - these changes were previously merged to main here. 6aef149 - the changes were temporarily reverted in main, due to a significant perf regression in models using the TorchSampler (observed by @byshiue). This PR is meant to re-merge these changes along with a fix to prevent the regression. The first commit of this PR is actually just the reverted revert - filter it out of the changes to see previously unmerged changes. Signed-off-by: Netanel Haber <[email protected]>

@byshiue

…state to trtllm samper tokens format (NVIDIA#5513) 58a8a8f - these changes were previously merged to main here. 6aef149 - the changes were temporarily reverted in main, due to a significant perf regression in models using the TorchSampler (observed by @byshiue). This PR is meant to re-merge these changes along with a fix to prevent the regression. The first commit of this PR is actually just the reverted revert - filter it out of the changes to see previously unmerged changes. Signed-off-by: Netanel Haber <[email protected]>

@byshiue

…state to trtllm samper tokens format (NVIDIA#5513) 58a8a8f - these changes were previously merged to main here. 6aef149 - the changes were temporarily reverted in main, due to a significant perf regression in models using the TorchSampler (observed by @byshiue). This PR is meant to re-merge these changes along with a fix to prevent the regression. The first commit of this PR is actually just the reverted revert - filter it out of the changes to see previously unmerged changes. Signed-off-by: Netanel Haber <[email protected]>

@byshiue

…state to trtllm samper tokens format (NVIDIA#5513) 58a8a8f - these changes were previously merged to main here. 6aef149 - the changes were temporarily reverted in main, due to a significant perf regression in models using the TorchSampler (observed by @byshiue). This PR is meant to re-merge these changes along with a fix to prevent the regression. The first commit of this PR is actually just the reverted revert - filter it out of the changes to see previously unmerged changes. Signed-off-by: Netanel Haber <[email protected]>

@byshiue

…state to trtllm samper tokens format (NVIDIA#5513) 58a8a8f - these changes were previously merged to main here. 6aef149 - the changes were temporarily reverted in main, due to a significant perf regression in models using the TorchSampler (observed by @byshiue). This PR is meant to re-merge these changes along with a fix to prevent the regression. The first commit of this PR is actually just the reverted revert - filter it out of the changes to see previously unmerged changes. Signed-off-by: Netanel Haber <[email protected]>

feature: unify new_tokens format sample state to trtllm samper new_to…

c4f18e6

…kens format Signed-off-by: Netanel Haber <[email protected]>

Funatiq requested review from Funatiq June 26, 2025 17:42

netanel-haber force-pushed the user/nhaber/feature/fixed_align_sample_state_with_trtllm_sampler_sample_state branch 3 times, most recently from 6874175 to 035e67a Compare June 29, 2025 15:57

support all features

84138c6

Signed-off-by: Netanel Haber <[email protected]> minimize diff Signed-off-by: Netanel Haber <[email protected]> minimize diff Signed-off-by: Netanel Haber <[email protected]>

netanel-haber force-pushed the user/nhaber/feature/fixed_align_sample_state_with_trtllm_sampler_sample_state branch from 6397d52 to 84138c6 Compare June 29, 2025 16:06

Merge branch 'main' into user/nhaber/feature/fixed_align_sample_state…

81ee330

…_with_trtllm_sampler_sample_state Signed-off-by: Netanel Haber <[email protected]>

netanel-haber marked this pull request as ready for review June 29, 2025 16:22

netanel-haber requested review from a team as code owners June 29, 2025 16:22

netanel-haber requested review from lucaslie and yuxianq June 29, 2025 16:22

netanel-haber requested review from DomBrown, byshiue, dcampora and suyoggupta June 29, 2025 16:26

wili-65535 reviewed Jun 30, 2025

View reviewed changes

tensorrt_llm/_torch/speculative/mtp.py Show resolved Hide resolved

tensorrt_llm/_torch/speculative/utils.py Show resolved Hide resolved

tensorrt_llm/_torch/pyexecutor/llm_request.py Show resolved Hide resolved

wili-65535 approved these changes Jun 30, 2025

View reviewed changes

netanel-haber enabled auto-merge (squash) June 30, 2025 10:37

Funatiq requested a review from Copilot June 30, 2025 10:49

This comment was marked as outdated.

Sign in to view

Funatiq reviewed Jun 30, 2025

View reviewed changes

tensorrt_llm/_torch/pyexecutor/sampler.py Show resolved Hide resolved

netanel-haber added 2 commits June 30, 2025 12:37

Merge branch 'main' into user/nhaber/feature/fixed_align_sample_state…

5eb8b49

…_with_trtllm_sampler_sample_state Signed-off-by: Netanel Haber <[email protected]>

revert behavior change back to: if non mixed sampling, always greedy …

051fe4a

…sampling Signed-off-by: Netanel Haber <[email protected]>

netanel-haber force-pushed the user/nhaber/feature/fixed_align_sample_state_with_trtllm_sampler_sample_state branch from a68c9fd to 051fe4a Compare June 30, 2025 13:27

revert behavior change back to: if non mixed sampling, always greedy …

dd73678

…sampling Signed-off-by: Netanel Haber <[email protected]>

dcampora approved these changes Jun 30, 2025

View reviewed changes

suyoggupta reviewed Jun 30, 2025

View reviewed changes

suyoggupta requested review from Copilot and suyoggupta June 30, 2025 18:58

suyoggupta approved these changes Jun 30, 2025

View reviewed changes

netanel-haber merged commit 6ee94c7 into NVIDIA:main Jun 30, 2025
3 checks passed

Copilot AI reviewed Jun 30, 2025

View reviewed changes

	def all_requests(self) -> list[LlmRequest]:
	def all_requests(self) -> Sequence[LlmRequest]:

feature: unify new_tokens format sample state to trtllm samper tokens format #5513

feature: unify new_tokens format sample state to trtllm samper tokens format #5513

Uh oh!

Conversation

netanel-haber commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tensorrt-cicd commented Jun 27, 2025

Uh oh!

tensorrt-cicd commented Jun 27, 2025

Uh oh!

dcampora commented Jun 27, 2025

Uh oh!

netanel-haber commented Jun 29, 2025

Uh oh!

tensorrt-cicd commented Jun 29, 2025

Uh oh!

tensorrt-cicd commented Jun 29, 2025

Uh oh!

wili-65535 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

netanel-haber commented Jun 30, 2025

Uh oh!

tensorrt-cicd commented Jun 30, 2025

Uh oh!

tensorrt-cicd commented Jun 30, 2025

Uh oh!

dcampora left a comment

Choose a reason for hiding this comment

Uh oh!

suyoggupta left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

netanel-haber commented Jun 26, 2025 •

edited

Loading